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ABSTRACT 

The huge amount of available biological expression level data sets all over the world requires 
automated techniques that help scientists to analyze, understand, and cluster protein structure 
prediction, multiple alignment of sequences, phylogenic inferences etc. Major hurdles at this point 
include computational complexity and reliability of the searching algorithms, artificial neural networks 
a Soft Computing Paradigm widely used in analyzing gene expression level data produced by 
microarray technology on tlte genomic scale, sequencing genes on the genomic scale, sequencing 
proteins and amino acids, etc because of its features such as strong capacity of nonlinear mapping, high 
accuracy for learning, and good robustness. ANN be trained from examples without the need for a 
thorough understanding of the task in hand, and able to show surprising generalization performance 
and predicting power In this paper we review some of the artificial neural networks and related pattern 
recognition techniques that have been used in this area. 

General Terms: Artificial Neural Networks, Architectures, Learning Algorithms, Topologies, 
Biological Sequence. 



I. INTRODUCTION 

Advancement in the genomics arena like The Human Genome Project (HGP), Microarray Data 
Analysis can produce large amounts of expression level data by parallel processing. Microarray Data Analysis 
aims to measure mRNA levels in particular cells or tissues for many genes at once that relies on the 
hybridization properties of nucleic acids on a genomic scale. Microarrays have given a new direction towards 
the study of genome by allowing researchers to study the expression of thousands of genes simultaneously for 
the first time. It is being predicted that this is essential to understanding the role of genes in various biological 
functions. 

The action of discovering a motif (a short DNA or protein sequence) of gene expression is closely 
related to correlating sequences of genes to specific biological functions, helps in understanding the role of 
genes in biological functions. 

The most challenging area of Computational Biology is to study thousands of genes across diverse 
conditions simultaneously. New computational and data mining techniques need to be developed for low-cost, 
low precision (approximate), good solutions in order to properly comprehend and interpret expression. 
Gene Expression Analysis includes Differential analysis/marker selection for searching genes that are 
differentially expressed in distinct phenotypes, Class prediction (supervised learning), Class discovery 
(unsupervised learning), Pathway analysis(search for sets of genes differentially expressed in distinct 
phenotypes) and data conversion tasks, such as filtering and normalizing, which are standard prerequisites for 
genomic data analysis. 

Artificial Neural Network (ANNs) inspired by Biological Neural Network (BNNs) and Statistical 
Learning are widely used pattern recognition techniques in microarray data analysis. ANN can be trained to 
identify the relevant genes that are used to make this distinction which is also distinguishing between 
complementary DNA (cDNA) microarray data of two types of colorectal lesions. 

We have surveyed a number of Artificial Neural Network (ANNs), some important learning algorithms 
and related techniques that help (have been used) in discovering or correlating patterns of expression data (e.g. 
microarray data) to biological function. We have given a chronological survey of calculating number of hidden 
neurons in neural network. In this paper you will find a summary of our survey along with basic concepts of 
ANNs and some important learning algorithms. 
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Artificial Neural Networks (ANNs) is unlike traditional computing (algorithmic approach) which 
require prior instruction to solve a problem, belong to the adaptive class of techniques in the machine learning 
arena. ANNS are used as a solution to various nonlinear complex problems; however, their success as an 
intelligent pattern recognition methodology has been most prominently advertised. 

ANNs were inspired by the biological neural networks, whose performance is analogous to the most 
basic functions of human brains. Most models of ANNs are organized in the form of a number of processing 
units (also called artificial neurons, or simply neurons [1]), and a number of weighted connections (artificial 
synapses) between the neurons. 

Though artificial neural networks are not an exact copy of biological human brain, it is important to 
begin with understanding fundamental concepts of Organization of the Nervous System and Brain. 

Nervous System: 

The human nervous system can be broken down into three stages that may be represented in block diagram form 
as: 
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The receptors collect information from the environment (e.g. photons on the retina), The effectors 
generate interactions with the environment (e.g. activate muscles), The flow of information/activation is 
represented by arrows - feedforward and feedback. Naturally, this paper will be primarily concerned with how 
the neural network in the middle works. 




The majority of neurons encode their activations or outputs as a series of brief electrical pulses (i.e. 
spikes or action potentials). When sufficient input is received (i.e. a threshold is exceeded), the neuron generates 
an action potential or 'spike' (i.e. it 'fires'). 

That action potential is transmitted along the axon to other neurons, or to structures outside the nervous 
systems (e.g., muscles). 

The neuron's cell body (soma) processes the incoming activations and converts them into output 
activations by summation of incoming signals (spatially and temporally). 

The neuron's nucleus contains the genetic material in the form of DNA. This exists in most types of cells, not 
just neurons. 

Dendrites are fibres which emanate from the cell body and provide the receptive zones that receive 
activation from other neurons. Signals from connected neurons are collected by the dendrites. 

Axons are fibres acting as transmission lines that send activation to other neurons. 
The junctions that allow signal transmission between the axons and dendrites are called synapses. The process 
of transmission is by diffusion of chemicals called neurotransmitters across the synaptic cleft. 
If sufficient input is not received (i.e. the threshold is not exceeded), the inputs quickly decay and no action 
potential is generated. 
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Timing is clearly important - input signals must arrive together strong inputs will generate more action 
potentials per unit time. 

The process of building an ANN, similar to its biological inspiration, involves a learning episode (also 
called training). During learning episode, the Network requires a massive number of associative mappings to 
observe a sequence of recorded data, and adjusts the strength of its synapses according to a learning algorithm 
and based on the observed data, so that a particular input leads to a specific target output. 

Training a neural network is the process of finding a set of weight and bias values so that for a given 
set of inputs, the outputs produced by the neural network are very close to some target values. 

A neural network can be thought of as a complicated mathematical function that has various constants 
called weights and biases, which must be determined. The process of finding the set of weights and bias values 
that best match your existing data is called training the neural network. There are many ways to train a neural 
network. 

Learning algorithms are generally divided into two types, supervised and unsupervised. 

Input Neuron with bias 




a=fiwp+b) 

The scalar input p is transmitted through a connection that multiplies its strength by the scalar weight 
w, to form the product wp, again a scalar, b is a bias value allows you to shift the activation function to the left 
or right, which may be critical for successful learning. The activation function of a node defines the output of 
that node given an input or set of inputs. Neurons are "switches" that output a "1" when they are sufficiently 
Activated, and '0' when not. Here the weighted input wp + b is the argument of the activation function /, 
typically a step function or a sigmoid function, which takes the argument n and produces the scalar output a. 
The bias is much like a weight, except that it has a constant input of 1 . [2] 



Depending upon the problem variety of Activation function is used 

1 . Linear Activation function like step function 

2. Nonlinear Activation function like sigmoid function 

Using a nonlinear function which approximates a linear threshold allows a network to approximate nonlinear 
functions using only small number of nodes. 
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A list of common transfer function (acti vation functions) and their name s can be found in following Table 
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Training the network nothing but adjusting both w and b scalar parameters of the neuron itself so that the cost 
functions is minimized and to exhibits some desired or interesting behavior. 

Cost function: 

C = (I(x,-xr) 2 )/N 

Where Xj is desired output and x,'s are the output of the neural network. 

The supervised algorithms require labelled number of hidden neurons in neural network number of hidden 
neurons in neural network training data and external agent for supervision. In other words, they require more a 
prior knowledge about the training set. 

The most important, and attractive, feature of ANNs is their capability of learning (generalizing) from 
example (Extracting knowledge from data) and adapt dynamically to the changing system. ANNs can do this 
without any pre-specified rules that define intelligence or represent an expert's knowledge. This feature makes 
ANNs a very popular choice for gene expression analysis and sequencing. 

ANNs can select its own relevant input variable from the available data, and the model is developed 
subsequently which can in turn greatly increase the expert's knowledge and understanding of the problem. 

In this paper, we discuss some important ANN like Feedforward Neural Networks and 
Backpropagation Neural Networks. 

Feedforward Neural Network - The feedforward neural network consist of a series of layers. In this 
network the information moves in only one direction — forwards: From the input layer data goes through the 
hidden layer (if any) and to the output layer. There are no cycles or loops in the network. Any finite input-output 
mapping is possible with one hidden layer and enough neurons in it. 

Hi 
Hi 



Input Neurons ' Hidden Neurons Output Neurons 




The network consists of neurons distributed in three different layers, each of which computes a 
function (called an activation function) of the inputs carried on the in -edges (edges pointing into the neuron) 
and sends the output on its out-edges (edges pointing out of the neuron). The inputs and outputs are weighed 
by weights W; and shifted by bias factor specific to each neuron. It has been shown that for certain neural 
network topologies, any continuous function can be accurately approximated by some set of weights and biases. 
Therefore, we would like to have an algorithm which when given a function/, learns a set of weights and biases 
which accurately approximate the function. For feed-forward neural networks (artificial neural networks 
where the topology graph does not contain any directed cycles), the back -propagation algorithm described later 
does exactly that. 

Output of a single neuron can be calculated as follows 

Output = A ( X W„ * I n + bias ) where A is the activation function of the neuron, W n is the weight of the n th in- 
edge, In is the input carried across the n th in-edge, and bias is the bias of the neuron. 

Weight of a neuron can be calculated from the training data a pairs of sets of inputs and outputs (Xj,Yj) where X t 
is the input to all input neurons and Y; is the output of the neural network after running the inputs in Xj. 

The entire training dataset is D=Union ((X ; , Y ; )) 
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We need to find the weights of edges and biases of neurons in the network which minimize the sum of squares 
error (SSE) of the training data. The SSE of the network with respect to the training data D is computed using 
the formula, 

SSE(D) =Sum((Yi - Zf) 

Feed-forward networks cannot perform temporal computation. More complex networks with internal 
feedback paths are required for temporal behavior. 

Back propagation neural networks 

The weights and biases are initially assigned to a random continuous (real) value between -1 and 1. The 
algorithm trains the neural network incrementally, meaning that each instance is considered independently, and 
after considering a training instance, the weights and biases are updated before considering another one. The 
update is done in a way that performs a gradient descent minimization of the error with respect to the weights 
and biases. 




Input Neurons Hidden Neurons Output Neurons 

The learning algorithm can be split into three phases for an instance (X ; , Y;) for r= 1,2,3,. . .. 

Calculate Output - Calculate Z i5 the output of the network for the training instance's input set X t . 

Calculate Blame - Y ; is the desired value for Z ; , so when they are different, there is some error. From this error, 
we wish to compute a blame value for each neuron. Blames will later be used to decide by how much to adjust 
weights and biases. The blames are just the linear error. Blame can be shared with the help of Delta Rule. 

AWji = n * 5j* x, 

Sj U > > *f(hj) 

Where r| is the learning rate of the neural network, tj and yj are targeted and actual output of the j th neuron, hj is 
the weighted sum of the neuron's inputs and f is the derivate of the activation function f. 

Problem with Multilayer Network is that we don't know the targeted output value for the Hidden layer neurons. 

This can be solved by a trick: 

5, = I ( 5 k * Wki ) * f (hi) 

The first factor in parenthesis involving the sum over k is an approximation to (t r ai) for the hidden layers when 
we don't know tj. 

Calculate Blame for Output Neurons - The blame of each output neuron is calculated as Y r Zj. The blame should 
be proportional to partial derivatives for the gradient to minimize the sum squared error of the whole network. 

Calculate Blame for Input and Hidden Neurons - The blame for an input or hidden neuron is calculated in terms 
of the blames of its out-neurons (the ones that it is pointing to). Because of this dependency, the blames need to 
be calculated in a topologically sorted order of the reverse of the network topology graph. This is similar to the 
order in which the outputs of neurons are calculated except that the topology graph must first be reversed before 
topologically sorting. The formula for the blame of an input or hidden neuron is £ W k * E k where E k is the 
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blame of an out-neuron and Wk is the weight of the edge that connects to the out-neuron. This step of 
propagating the error backwards is why the learning algorithm is called back -propagation. 

Adjust Weights and Biases - This last step performs the actual gradient descent by adjusting the weights with 
the formula: Wy = + r * Ej * Aj'(Ij) * O; where r is the learning rate, Ej is the blame of neuron j, Aj' is the 
derivative of neuron j's activation function, Ij is the input fed to neuron j during calculation of the output in the 
first step, and Oj is the output of neuron i during the first step. The biases are similarly adjusted using the 
formula: bias_i = bias_i + r*E ; . 

Learning completes with the all training instances are run. To achieve better accuracy, we may wish to run the 
training data multiple times through the learning algorithm without over fitting. 

Neural networks need a lot of parameters like NN architecture, number of hidden neurons in hidden layer, 
activation function, inputs, and updating of weights etc and many of these parameters can significantly affect the 
accuracy, stability and runtime of neural networks. The backpropagation algorithm, one can only expect the 
algorithm to have good performance with a finite training set. 

A fundamental question that is often raised in the applications of neural networks is "how large does the 
network have to be to perform a desired task? ". Answers to this question are directly related to the capability of 
neural networks, and should be given independently of the learning algorithms employed. 

Number of Hidden Neurons 

It is crucial to pick a proper neural network topology from infinite set of directed acyclic graphs before 
training with some restriction is that the number of input nodes must match the number of inputs to the function 
and the number of output nodes must match the number of outputs of the function. But we are free to choose 
any topology for the hidden neurons as long as there is a path from each input neuron to some output neuron and 
vice versa. 

The fixing of the hidden neuron is an important for the designing of Neural Network model. The 
random selection of hidden neuron may cause over fitting and under fitting problem in the network. Several 
researchers tried and proposed many approach for fixing hidden neuron. The survey has been made to find the 
number of hidden neurons in neural network is and described in a chronological manner. 



Year 


Method/Concept 


Reference 


1991 


exactly solves the multilayer neural network training problem, for 
any arbitrary training set with bounds on the size of a multilayer 
neural network to exactly implement an arbitrary training set; [3] 


Michael A. Sartori 
and Panos J. 
Antsaklis 




Bounds on the Number of Hidden Neurons in Multilayer Perceptrons 


Shih-Chi Huang 
and Yih-Fang 
Huang 


1993 


Proposed two parallel hyper plane methods for finding the number of 
hidden neurons. The 2" / 3 hidden neurons are sufficient for this 
design of the network. [4] 


Arai 


1995 


Investigated the estimation theory to find the number of hidden units 
in the higher order feedforward neural network. This theory is 
applied to the time series prediction. The determination of an optimal 
number of hidden neurons is obtained when the sufficient number of 
hidden neurons is assumed. According to the estimation theory, the 
sufficient number of hidden units in the second-order neural network 
and the first-order neural networks are 4 and 7, respectively. The 
simulation results show that the second-order neural network is 
better than the first-order in training convergence. According to that, 
the network with few nodes in the hidden layer will not be powerful 
for most applications. The drawback is long training and testing 
time. [5] 


Li et al. 


1996 


presented another method to find an optimal number of hidden units. 
The drawback is that there is no guarantee that the network with a 
given number of hidden units will find the correct weights. 
According to the statistical behaviours of the output of the hidden 
units, if a network has large number of hidden nodes, a linear 


Hagiwara 
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relationship is obtained in the hidden nodes. [6] 



1997 Developed a method to fix hidden neuron with negligible error based 
on Akaike's information criteria. The number of hidden neurons in 
three layer neural network is N -1 and four-layer neural network is 

N/2 + 3 where N is the input-target relation[7] 

1998 Proposed a statistical estimation of number of hidden neurons. The 
merits are speed learning. The number of hidden neurons mainly 
depends on the output error. The estimation theory is constructed by 
adding hidden neurons one by one. The number of hidden neurons is 
formulated as N h = K Log II P C ZI I/Log S. where S is total number of 
candidates that are randomly searched for optimum hidden unit c is 

allowable error. [8] 



Tamura 
Tateishi 



Fujita 



and 



1998 Upper bounds on the number of hidden neurons in feed forward 
networks with arbitrary bounded nonlinear activation functions - 
This paper used the fact what Sartori and Antsaklis pointed out in 
1991, the sigmoid function is not necessary to be used as non- 
linearity of the neurons and . Huang and Babri showed that standard 
single-hidden layer feed forward networks (SLFN) with not more 
than N hidden neurons and with any bounded non linear activation 
function which has a limit at one infinity can learn these N distinct 

samples with zero error. 

1999 Presented a method to determine the number of hidden units which 
are applied in the prediction of cancer cells. Normally, training starts 
with many hidden units and then prune the network once it has 
trained. However pruning does not always improve generalization. 
The initial weights for input to hidden layer and the number of 
hidden units are determined automatically. The demerit is no optimal 

solution. [9] 



Guang-Bin 
Huang, Haroon A. 
Babri 



Keeni et al 



2001 Presented a statistical approach to find the optimal number of hidden 
units in prediction applications. The minimal errors are obtained by 
the increase of number of hidden units. Md. Islam and Murase 
proposed a large number of hidden nodes in weight freezing of single 
hidden layer networks. The generalization ability of network may be 
degraded when the number of hidden nodes(N h ) is large because N th 
hidden node may have some spurious connections. [10] 



Onoda, 

Md. Islam and 
Murase 



2003 Implemented a set covering algorithm (SCA) in three-layer neural 
network. The SCA is based on unit sphere covering (USC) of 
hamming space. This methodology is based on the number of inputs. 
Theoretically the number of hidden neurons is estimated by random 
search. The output error decreases with N h being added. The N h is 
significant in characterizing the performance of the network. The 
number of hidden neurons should not be too large for heuristic 
learning system. The N h found in set covering algorithm 
is 3L/2 hidden neurons where L is the number of unit spheres 
contained in N dimensional hidden space [1 1] 



Zhang et al 



2003 developed the model for learning and storage capacity of two- 
hidden-layer feedforward network. In 2003, Huang [12] formulated 
the following: in single hidden layer and in two hidden layer. 



Huang 



2004 A three-layer neural network (NN) with novel adaptive architecture 
has been developed. The hidden layer of the network consists of 
slabs of single neuron models, where neurons within a slab — but not 
between slabs — have the same type of activation function. The 
network activation functions in all three layers have adaptable 
parameters. 



Panayiota Poirazi, 
Costas Neocleous, 
Costantinos S. 
Pattichis 



2006 



2006 



developed a separate learning algorithm which includes a 
deterministic and heuristic approach. In this algorithm, hidden-to- 
output and input-to-hidden nodes are separately trained. It solved the 
local minima in two-layered feedforward network. The achievement 
is best convergence speed [14] 

Estimating the Number of Hidden Neurons in a Feedforward 
Network Using the Singular Value Decomposition - to quantify the 
significance of increasing the number of neurons in the hidden layer 
of a feedforward neural network architecture using the singular value 
decomposition (SVD).[15] 



Choi et al. 



Eu Jin Teoh, 
Cheng Xiang, 
Kay Chen Tan 
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2007 


Optimizing number of hidden neurons in neural networks - 
A method is proposed using, quantitative standard based on the 
Signal to Noise Ratio Figure (SNRF) to detect over fitting 
automatically using the training error only, to optimize the number of 
hidden neurons in Neural Network to avoid over fitting problem. [16] 


Yinyin Liu, 
Janusz A. Starzyk, 
Zhen Zhu 


2008 


presented the lower bound on the number of hidden neurons. The 
number of hidden neurons is N h =q" where q is valued upper bound 
function. The calculated values represent that the lower bound is 
tighter than the ones that has existed. The lower and upper bound on 
the number of hidden neurons help to design constructive learning 
algorithms. The lower bound can accelerate the learning speed, and 
the upper bound gives the stopping condition of constructive learning 
algorithms. It can be applied to the design of constructive learning 
algorithm with training set N numbers. [17] 


Jiang et al 


2009 


investigated the effect of learning stability and hidden neuron in 
neural network. The simulation results show that the hidden output 
connection weight becomes small as number of hidden neurons N h 
becomes large. This is implemented in random number mapping 
problems. The formula for hidden nodes is N h =VNjN 0 where N is 
the input neuron and N 0 is the output neuron. Since neural network 
has several assumptions which are given before starting the 
discussion to prevent divergent. In unstable models, number of 
hidden neurons becomes too large or too small. A trade-off is formed 
that if the number of hidden neurons becomes too large, output of 
neurons becomes unstable, and if the number of hidden neurons 
becomes too small, the hidden neurons becomes unstable again. [18] 


Shibata and JJseda 


2009 


Bernoulli Neural Network with Weights Directly Determined and 
with the Number of Hidden- Layer Neurons Automatically 
Determined Based upon polynomial interpolation and approximation 
theory, a special type of feedforward neural -network is constructed 
in this paper with hidden-layer neurons activated by Bernoulli 
polynomials to overcome the drawback of the conventional Back 
propagation neural network algorithm such as slow convergence and 
local minima existence. A weight detection algorithm is also 
proposed which detects weights of the neural network without the 
lengthy back propagation training procedures. More over a Structure 
Automatic Determination algorithm is proposed which can obtain the 
optimal number of hidden layers neurons in order of achieving the 
highest learning accuracy. [19] 


Yunong Zhang, 
Gongqin Ruan 


2010 


proposed a technique to find the number of hidden neurons in MLP 
network using coarse-to-fine search technique which is applied in 
skin detection. This technique includes binary search and sequential 
search. This implementation is trained by 30 networks and searched 
for lowest mean squared error. The sequential search is performed in 
order to find the best number of hidden neurons[20] 


Doukim et al. 


2010 


proposed the learning algorithms for determination of number of 
hidden neurons. [21] 


Wu and Hong 


2011 


Panchal proposed a methodology to analyze the behavior of MLP. 
The number of hidden layers is inversely proportional to the minimal 
error. [22] 


Panchal et al. 


2011 


Improving the character reorganization efficiency of feed forward 
BP Neural Network- Presented a method for improvement of 
character recognition capability of feed-forward back-propagation 
neural network by using one , two and three layers. Also it has been 
shown that with more hidden layers and modified momentum term a 
soft real time system can be introduced where performance is more 
critical. [23] 


Amit Choudhary , 
Rahul Rishi 


2011 


A Clustering Based Method to Stipulate the Number of Hidden 
Neurons of MLP Neural Networks: Applications in Pattern 
Recognition we propose a clustering based method to state the 
number of hidden neurons that can produce good classification error 
rate. Our cluster based method uses much less CPU time consuming 
than method [24] 


m.r. Silvestre, 
s.m. Oikawa2, 
f.h.t. Vieira, 
1.1. Ling 


2012 


Approximating Number of Hidden layer neurons in Multiple Hidden 
Layer BPNN Architecture 


Karsoliya Saurabh 
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The process of deciding the number of hidden layers and number of 
neurons in each hidden layer is still confusing. In this paper, an 
survey is made in order to resolved the problem of number of 
neurons in each hidden layer and the number of hidden layers 
required. [25] 




2012 


developed a method used in proper NN architectures. The 
advantages are the absence of trial-and-error method and 
preservation of the generalization ability. The three networks MLP, 
bridged MLP, and fully connected cascaded network are used. The 
implemented formula: as follows, N h =N+l for MLP Network, 
N h =2N+lfor bridged MLP Network and N h =2 n -1 for fully 
connected cascade NN. The experimental results show that the 
successive rate decreases with increasing parity number. The 
successive rate increases with number of neurons used. The result is 
obtained with 85% accuracy. [26] 


Hunter et al 



Some sources claim that one hidden layer of neurons between the input nodes and output nodes are enough to 
accurately approximate any continuous function. 

This is an extremely important finding because the computational complexity of the learning algorithm 
increases quadratically with the number of neurons in the network and this experiment shows that only very few 
neurons are necessary for this dataset. 

Comparison to Other Classifiers: Neural networks are even a good alternative to other classification 
algorithms such as decision trees, Bayesian network, Inductive logic programming and naive Bayes. The chart 
below shows the classification accuracy of neural networks compared to that of decision trees and naive Bayes. 
It seems that neural networks perform about the same as those two classifiers and often times produce a little 
better results than naive Bayes. Neural networks Can model more arbitrary functions (nonlinear interactions, 
etc.) and therefore might be more accurate, provided there is enough training data. But it can be prone to over- 
fitting as well. 

Layered, feed-forward, backpropagation neural networks: These are a class of ANNs whose neurons are 
organized in layers. The layers are normally fully connected, meaning that each element (neuron) of a layer is 
connected to each element of the next layer. However, self-organizing varieties also exist in which a network 
starts either with a minimal number of synaptic connections between the layers and adds to the number as 
training progresses (constructive), or starts as a fully connected network and prunes connections based on the 
data observed in training (destructive). 

Backpropagation [27] is a learning algorithm that, in its original version, belongs to the gradient descent 
optimization methods [28]. The combination of backpropagation learning algorithm and the feed-forward, 
layered networks provide the most popular type of ANNS. These ANNS have been applied to virtually all 
pattern recognition problems, and are typically the first networks tried on a new problem. The reason for this is 
the simplicity of the algorithm, and the vast body of research that has studied these networks. As such, in 
sequencing, many researchers have also used this type of network as a first line of attack. 
Examples can be mentioned in [29,30]. In [29] Wu has developed a system called gene classification artificial 
neural system (GenCANS), which is based on a three layered, feed-forward backpropagation network. 
GenCANS was designed to "classify new (unknown) sequences into predefined (known) classes. It involves two 
steps, sequence encoding and neural network classification, to map molecular sequences (input) into gene 
families (output)". In [30] the same type of network has been used to perform rapid searches for sequences of 
proteins. 

Other examples can be mentioned in Snyder and Stormo's work in designing a system called 
GeneParser.[31] Here authors experimented with two variations of a single layer network (one fully connected, 
and one partially connected with an activation bios added to some inputs), as well as a partially connected two- 
layer network. The authors use dynamic programming as the learning algorithm in order to train the system for 
protein sequencing. 

1 ANNS were the first group of machine learning algorithm to be used on a biological pattern recognition 
problem. [32] 

2 Distinguish between sporadic colorectal adenomas and cancers (SAC), and inflammatory bowel disease 
(IBD)-associated dysplasias and cancers. 
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Self-organizing neural networks: These networks are a very large class of neural networks whose structure 
(number of neurons, number of synaptic connections, number of modules, or number of layers) changes during 
learning based on the observed data. There are two classes of this type of networks: destructive and constructive. 
Destructive networks are initially a fully connected topology and the learning algorithm prunes synapses 
(sometime entire neurons, modules, or layers) based on the observed data. The final remaining network after 
learning is complete, usually is a sparsely connected network. Constructive algorithms start with a minimally 
connected network, and gradually add synapses (neurons, modules, or layers) as training progresses, in order to 
accommodate for the complexity of the task at hand. 

Self-organizing networks divide into some major and very well known groups. Bellow we summarize the 
application some of these major groups in sequencing and expression level data analys is Self-Organizing Map: 
A self-organizing map (SOM) [33] is a type of neural network approach first proposed by Kohonen [34]. SOMs 
have been used as a divisive clustering approach in many areas including genomics. Several groups have used 
SOMs to discover patterns in gene expression data. Among these Toronen et al. [35], Golub et al. [36], and 
Tamayo et al. [37] can be mentioned. Tamayo and colleagues [37] use self-organizing maps to explore patterns 
of gene expression generated using Affymetrix arrays, and provide the GENECLUSTER implementation of 
SOMs. Tamayo and his colleagues explain the implementation of SOM for expression level data as follows: 
"An SOM has a set of nodes with a simple topology and a distance function on the nodes. Nodes are 
iteratively mapped into k-dimensional "gene expression space" (in which the i coordinate represents the 
expression level in the i sample)" [38]. A SOM assigns genes to a series of partitions on the basis of the 
similarity of their expression vectors to reference vectors that are defined for each partition. The summary of the 
basic SOM algorithm is perhaps best described in Quackenbush's review [39]: "First, random vectors are 
constructed and assigned to each partition. Second, a gene is picked at random and, using a selected distance 
metric, the reference vector that is closest to the gene is identified. Third, the reference vector is then adjusted so 
that it is more similar to the vector of the assigned gene. The reference vectors that are nearly on the two- 
dimensional gird are also adjusted so that they are more similar to the vector of the assigned gene. Fourth, steps 
2 and 3 are iterated several thousand times, decreasing the amount by which the reference vectors are adjusted 
and increasing the stringency used to define closeness in each step. As the process continues, the reference 
vectors converge to fixed values. Last, the genes are mapped to the relevant partitions depending on the 
reference vector to which they are most similar". 

SOMs, like the gene shaving approach [40], have the distinct advantage that they allow a priori knowledge to be 
included in the clustering process. Tamayo et. al. explain this and other advantages of the SOM approach as 
follows: "The SOM has a number of features that make them particularly well suited to 

clustering and analyzing gene expression patterns. They are ideally suited to exploratory data analysis, allowing 
one to impose partial structure on the clusters (in contrast to the rigid structure of hierarchical clustering, the 
strong prior hypotheses used in Bayesian clustering, and the nonstructural of k-means clustering) facilitating 
easy visualization and interpretation. SOMs have good computational properties and are easy to implement, 
reasonably fast, and are scalable to large data sets". 

The most prominent disadvantage of the SOM approach is that it is difficult to know when to stop the algorithm. 
If the map is allowed to grow indefinitely, the size of SOM is gradually increased to a point when clearly 
different sets of expression patterns are identified. Therefore, as with k-means clustering, the user has to rely on 
some other source of information, such as PCA, to determine the number of clusters that best represents the 
available data. For this reason, Sasik and his colleagues believe that "SOM, as implemented by Tamayo et al is 
essentially a restricted version of k-means: Here, the k clusters are linked by some arbitrary user-imposed 
topological constraints (e.g. a 3 x 2 grid), and as such suffers from all of the problems mentioned above for k- 
means (and more), except that the constraints expedites the optimization process". [37] 

There are many varieties to SOM, among which the self-organizing feature maps (SOFM) should be mentioned. 
The growing cell structure (GCS) [41] is another derivative of SOFM. It is a self organizing and incremental 
(constructive) neural learning approach. The unsupervised variety of GCS was used by Azuaje [42] for 
discovery of gene expression patterns in B-cell lymphoma. Using cDNA microarray expression data, GCS was 
able to "identify normal and diffuse large B-cell lymphoma (DLBCL) patients. Furthermore, it distinguishes 
patients with molecularly distinct types of DLBCL without previous knowledge of those subclasses". 

Self-organizing trees: Self-organizing trees are normally constructive neural network methods that develop 
into a tree (usually binary tree) topology during learning. Among examples of these networks applied to 
sequencing the work of Dopazo et al. [43], Wang et al. [44,45], and Herrero et al. [46] can be mentioned. 
Dopazo and Carazo introduce the self-organizing tree algorithm (SOTA). [43] SOTA is a hierarchical neural 
network that grows into a binary tree topology. For this reason SOTA can be considered a hierarchical 
clustering algorithm [41]. SOTA is based on Kohonen's SOM discussed above and Fritzke's growing cell [34]. 
SOTA was originally designed to analyze pre-aligned sequences of genes (polygenetic reconstruction). 
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Wang and colleagues extended that work in [43,44] in order to classify protein sequences: "(SOTA) is now 
adapted to be able to analyze patterns associated to the frequency of residues along a sequence, such as protein 
dipeptide composition and other n-gram compositions". In [46], Herrero and colleagues demonstrate the 
application of SOTA to Microarray expression data. The show that SOTA's performance is superior to that of 
classical hierarchical clustering techniques. Among the advantages of SOTA as compared to hierarchical cluster 
algorithms, one can mention its time complexity, and its top-to-bottom hierarchical approach. SOTA's runtimes 
are approximately linear with the number of items to be classified, making it suitable for large datasets. Also, 
because SOTA forms higher clusters in the hierarchy before forming the lower clusters, it can be stopped at any 
level of hierarchy and still produce meaningful intermediate results. 

There are many other types of self-organizing trees that for space considerations cannot be mentioned here. 
SOTA was only chosen here to give a brief introduction to these types of networks, and not because we believe 
that these are the most common types of self-organizing trees. 

ART and its derivatives: Adaptive Resonance Theory was introduced by Stephen Grossberg [47] in 
1976. Networks designed based on ART are unsupervised and self-organizing, and only learn in the socalled 
"resonant" state. ART can form (stable) clusters of arbitrary sequences of input patterns by learning (entering 
resonant states) and self-organizing. Since the inception, many derivatives of ART have emerged. Among these 
ART-1 (the binary version of ART; forms clusters of binary input data) [48], A/?r-2(analog version of ART) 
[49], ART-2A (fast version of ART-2) [50], ARTS (includes "chemical transmitters" to control the search 
process in a hierarchical ART structure) [50], ARTMAP (supervised version of ART) [51] can be mentioned. 
Many hybrid varieties such as Fuzzy-ART [52], Fuzzy-ARTMAP (supervised Fuzzy-ART) [52] and simplified 
Fuzzy-ARTMAP (SFAM) [51] have also been developed. 

The ART family of networks has a broad application in virtually all areas of pattern recognition. As such, they 
have also been applied in biological sequencing problems. In general, in problem settings where the number of 
clusters is not previously known, researchers tend to use unsupervised ART, where when number of clusters is 
known a priori, usually the supervised version, ARTMAP, is used. 

Among the unsupervised implementations, the work of Tomida et al. [53] should be mentioned. Here the 
authors used Fuzzy ART for expression level data analysis. This study is significant in that it compares results 
of Fuzzy ART with those of hierarchical clustering, k-mean clustering, and SOMs in analyzing and sequencing 
expression level data on the genomic scale. In this study the authors consider a two dimensional problem space 
in which the first dimension is the expression level and the second dimension is time, in order to consider the 
variation of expression levels in time. In this study, the authors report the lowest gap index3 among the four 
algorithms considered for Fuzzy ART. In terms of robustness of clustering results, the authors report a 
robustness of 82.2% for Fuzzy ART versus 71.1%, 44.4% and 46.7%, for hierarchical clustering, fc-means 
clustering, and SOMs respectively. As a result the authorsconclude that Fuzzy ART also performs best in noisy 
data. 

Among supervised implementations, Azuaje's use of SFAM in cDNA data analysis with the purpose of 
discovery of gene function in cancer can be mentioned. [54] In this study the authors use a data set generated for 
a previous study [18] using SOFM. Here the authors were able to also distinguish between classes of cancer 
using SFAM instead of SOFM (reported earlier in this paper). 

Other neural network techniques 

Molecular computers and molecular neural networks: The idea of molecular and cellular computing dates 
back to the late 1950's when Richard Feynman delivered his famous paper describing "sub-microscopic" 
computers. He suggested that living cells and molecules could be thought of as potential computational 
components. Along the same lines, a revolutionary form of computing was developed by Leonard 
Aldeman in 1994, when he announced that he had been able to solve a small instance of a computationally 
intractable problem using a small vial of DNA. [55] Aldeman had accomplished this by representing 
information as sequences of bases in DNA molecules. He then showed how to exploit the self assembly property 
of DNA and use DNA-manipulation techniques to implement a simple but massively parallel random search. 
As a result, it has been suggested byMills [56] (among others) that a "DNA computer using cDNA as input 
might be ideal for clinical cell discrimination". In particular, Mills recommends the neural networks version of 
Aldeman's DNA computing scheme developed also by Mills and his colleagues [56] to the classification and 
sequencing tasks involved in expression level profiling. Mills reports in [56] that his "preliminary experimental 
results suggest that expression profiling should be feasible using a DNA neural network that acts directly on 
cDNA". 

The DNA neural networks are the most recent and innovative ANN solution suggested for expression data 
profiling and sequencing. Its massively parallel nature offers exceedingly fast computation. The method presents 
high potential in various aspects of the solution that it offers and should offer breakthrough advancement to the 
field. 
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Bayesian Neural Networks: There are a number of recent networks that have been suggested as solutions for 
sequencing and expression data analysis. For instance, Bayesian neural networks (BNNs), are another technique 
that has been recently used for gene expression analysis. Liang et al. [57] have used the BNNs with structural 
learning for exploring microarray data in gene expressions. The BNNs are an important addition to the host of 
ANN solutions that have been offered to the problem at hand, as they represent a 

large group of hybrid ANNs that combine classical ANNs with statistical classification and prediction theories. 

CONCLUSION 

There are a number of ANN solutions that have been offered to the problem of biological sequencing and 
expression data analysis. While many researchers have applied "classical" ANN solutions to the problem and 
are reporting favorable results as compared to non -adaptive counter parts, others have suggested or developed 
pioneering and revolutionary ANNs. Many of these techniques offer feasible solutions to the 
above mentioned problems, while some also offer significant speedup in achieving the solutions. We believe 
that the performance of a technique should ultimately be measured in the biological soundness and relevance of 
its results. As such, we believe that techniques that allow the incorporation of a priori biological knowledge 
show greater potential. 

3 In this study, the authors use gap index to "evaluate the similarity of profiles as area between each profile and 
average profile during temporal phase". This review represents only a small part of the research being conducted 
in the area, and only is meant as a complementary/continuation of the survey that others have conducted in this 
area. It should in no way to be taken as a complete survey of all algorithms. If not for any other reason, for the 
reason of limited space, some significant developments in the area had to be left out. Furthermore, new 
techniques and algorithm are being proposed for microarray data analysis on a daily basis, making survey 
articles such as this highly time dependent. 
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